# README file for "Online Social Network Effects in Labor Markets: Evidence From Facebook’s Entry to College Campuses".

## Replication Instructions:
- export to a local subdirectory on your computer. pipeline should be run in python , from the project directory.
- change the statapath variable in constants.yaml to the location of STATA 17 on your computer
- The tar file IPEDS.tar has all the cleaned IPEDS files used for the paper; it should be extracted into data/input/
- ```code/run_all.py``` is used to run the entire paper pipeline. uses both STATA and Python. run it by entering following in terminal: ```python3 code/run_all.py```.
- I include all intermediate data scripts in ```data/output/```, so the user can simply run the ```code/analysis/create_exhibits.do``` if the user is comfortable replicating with my pre-created intermediate data files.


## Required packages
STATA 17:
	- ssc install postrcspline
	- ssc install did_multiplegt
	- ssc install reghdfe
	- ssc install ivreghdfe
	- ssc install ivreg2
Python3:
	- igraph
	- pandas
	- numpy
	- yaml

## Directories:
- ado: this holds any ado files for stata analysis. Only contains the ```david4``` graphing scheme used to generate tables.
- code: code base for cleaning data (in clean/ subdirectory) and creating tables/figures (in analysis/ subdirectory)
- data: the directory holding the raw input datasets (in input/ subdirectory) and the output datasets created for running the actual analysis (in output/ subdirectory)
- output: directory to hold any output files, such as tables and figures.


## Code:
### code/clean/ Directory
This folder contains all scripts used to create intermediate datasets
- ```ipeds/```: This sub-directory holds the pre-processing scripts for data downloaded from IPEDS. These are _not_ run as part of the pre-processing pipeline and the output files are included in ```data/output/``` upon download. However, we include the replication scripts if one wishes to download the raw IPEDS files from the "Complete Data Files" section of the [IPEDS website](https://nces.ed.gov/ipeds/datacenter/DataFiles.aspx)
   	- ```clean_fall_enrollment.do```: This script takes the raw IPEDS fall enrollment survey files located in ```data/input/Fall_Enrollment/``` and creates the ```data/output/fall_enrollment``` file.
   	- ```clean_ic.do```: This script takes the raw IPEDS fall enrollment survey files located in ```data/input/Institutional_Characteristics/``` and creates the ```data/output/ic.data``` file.
   	- ```clean_grad_rates.do```: This script takes the raw IPEDS fall enrollment survey files located in ```data/input/fall_enrollment/``` and creates the ```data/output/grad_rates``` file.
   	- ```import_completions_data.py```: Due to large size of files, we do not include the raw IPEDS completions (degrees attained) files. instead, this script downloads them and then cleans them, outputing the following files, aggregated in 2 and 4 digit CIP level: ```data/output/cip2_completions.csv```, ```data/output/cip4_completions.csv```.
- ```clean_files.py```: this script runs all the cleaning scripts in appropriate order
- ```clean_freshmen_demos.py```: This file creates the racial/gender demographic variables used as controls in main analysis.
- ```create_FB_graph_data.py```: This file uses the igraph python package to process the graph FB data and create a user-level dataset containing network structure.
- ```create_equopp_panel.py```: This file cleans, imports, and creates the earnings measures used for main analysis panel. it also imports institutional characteristics data, and the demographics of each freshmen cohort, from IPEDS. I also create the exposure measure of FB dates in this script.
- ```create_cohort_graph_data.do```: This file creates a cohort (e.g. Harvard class of 2004) dataset of network statistics from the user-level dataset created in ```create_FB_graph_data.py```.
- ```create_selectivity_measure_NCES.do```: This file creates the selectivity measure of schools used in the NCES' Powerstats platform.
- ```create_gradAdj_exposure.do```: This file creates a measure of exposure to Facebook that adjusts for how long students in each entering class take to complete their degrees/stay in college.
- ```create_ageAdj_earnings.py```: This file creates the measures of earnings that adjust for the different ages students in each cohort enter college for the first time. Also creates Figures A4 and A5.
- ```linkedin_sorting.py```: This file creates the outcome measures used to measure employer concentration on LinkedIn (used for Table 6)
- ```linkedin_matchval.py```: This file creates the outcome measures used to measure what LinkedIn Alumni from each cohort do on linked-in, and match it to % completing types of degrees measures (used for Table 7)
- ```prepare_analysis_data.do```: This file takes IPEDS data and combines it with EqualOpp data, to create the main dataset used for actual analysis.

### code/analysis/ Directory
This folder contains all scripts used to create Tables and Figures for paper.
- ```create_exhibits.do```: this file creates all figures/tables used in the paper.
- ```summstats.do```: Creates Table 1.
- ```plot_rollout.do```: Creates Figure 1.
- ```fb_usage_plots.do```: Creates Figures associated with Appendix A.
- ```network_regs.do```: Creates Table 2 of paper.
- ```main_analysis.do```: Creates Figure 2, Tables 3-6, and all remaining appendix tables referenced in the main text of the paper, in addition to the placebo regression tables associated with Appendix F.
- ```linkedin_sorting.do```: Creates Table 7.
- ```linkedin_matchval.do```: Creates Table 8 and Appendix Figure A6.
- ```measerr_analysis.do```: Creates Table Associated with Appendix D.
- ```twowayFE.do```: Creates Table Associated with Appendix E.
- ```iv_regs.do```: Creates Table Associated with Appendix G, and Figure A7.
## Datasets:
### data/input/ directory
This folder contains all raw input datasets.
- ```Fall_Enrollment/```: This sub-directory has the data files for IPEDS fall enrollment surveys in our sample period. To replicate, one downloads the csv files from the complete data files section, and then runs the STATA program from the [NCES website](https://nces.ed.gov/ipeds/datacenter/DataFiles.aspx) associated with each CSV file.
- ```Grad_Rates/```: This sub-directory has the data files for IPEDS graduation rate surveys in our sample period. To replicate, one downloads the csv files from the complete data files section, and then runs the STATA program from the [NCES website](https://nces.ed.gov/ipeds/datacenter/DataFiles.aspx) associated with each CSV file.
- ```Institutional_Characteristics/```: This sub-directory has the data files for IPEDS instituional characteristics surveys in our sample period. To replicate, one downloads the csv files from the complete data files section, and then runs the STATA program from the [NCES website](https://nces.ed.gov/ipeds/datacenter/DataFiles.aspx) associated with each CSV file.
- ```bea_regions.csv```: contains mapping of BEA regions to State FIPS codes.
- ```CIP2_degreetypes.csv```: This maps the 2-digit CIP codes to the aggregate majors used by the [CollegeBoard](https://bigfuture.collegeboard.org/majors-careers)
- ```Crosswalk2000to2010.csv```: This maps 2-digit 2000 CIP codes to the revised 2010 CIP codes.
- ```EF2005A.zip```: fall enrollment dataset downloaded from IPEDS complete data files. for Fall 2005. this used
- ```EFlines_merge.csv```: This is used only for cleaning the fall enrollment data. it aggregates categories of enrollment across years to a common set.
- ```facebook100.zip```: This zip-file has the graph-level dataset from the Facebook network in September 2005. Downloaded from the [Internet Archive](https://web.archive.org/web/20050822175050/http://www.thefacebook.com:80/index.php?showall=1).
- ```FB100_university_start_dates_and_metadata_details_Aug_2005.csv```: Contains data on when school year started for first 100 FB schools, from [Johan Ugander's Website](https://azjacobs.com/fb100/)
- ```FB100_university_start_dates_and_metadata_details_Feb_2004.csv```: Contains data on when FB was released to campus at first 100 schools, from [Johan Ugander's Website](https://azjacobs.com/fb100/)
- ```FB_introduction_dates_augmented.csv```: A dataset containing the release dates of Facebook to schools, imputed from crawls of the Wayback Machine on the facebook website up to August 2005. Manually created by Luis Armona and Luca Braghieri.
- ```IPEDSlinks.csv```: This is used only for downloading the completions data from IPEDS. It contains links to directly download IPEDS data. If you use this for other papers/projects, please cite this paper.
- ```linkedin_cip4_xwalk.csv```: a manually created cross-walk from the LinkedIn career/occupation categories to 4-digit CIP codes.
- ```mrc_table3.csv```: The baseline longitudinal, cohort level, panel earnings data from the mobility report cards dataset available on the [OI website](https://opportunityinsights.org/data/). See downloadable READMEs from website for variable definitions.
- ```mrc_table5.csv```: Alternative earnings measures data from the mobility report cards dataset available on the [OI website](https://opportunityinsights.org/data/). See downloadable READMEs from website for variable definitions.
- ```table6_income_levels_by_cohort_parpctile.dta```: contains data on income levels by parent percentile, from [OI website](https://opportunityinsights.org/data/). See downloadable READMEs from website for variable definitions.
- ```mrc_table8.csv```: mapping from earnings percentiles to income levels, from the mobility report cards dataset available on the [OI website](https://opportunityinsights.org/data/). See downloadable READMEs from website for variable definitions.
- ```mrc_table10.csv```: Data on college-level characteristics, such as barrons selectivity tier, from the mobility report cards dataset available on the [OI website](https://opportunityinsights.org/data/). See downloadable READMEs from website for variable definitions.
- ```mrc_table11.csv```: Crosswalk from the mobility report cards super-opeid to regular opeids used by NCES. Available on the [OI website](https://opportunityinsights.org/data/). See downloadable READMEs from website for variable definitions.
- ```PowerStats_BPS_SELECTIVITY_CONTROL_CLEAN.csv```: Data downloaded from the Powerstats NCES platform on age of entry of freshmen, by NCES selectivity tier and public/private status.
- ```PowerStats_DROPOUT_SELECTIVITY_CONTROL_CLEAN.csv```: Data downloaded from the Powerstats NCES platform on when freshmen drop out of school, by NCES selectivity tier and public/private status.
- ```PowerStats_TRANSFER_SELECTIVITY_CONTROL_CLEAN.csv```: Data downloaded from the Powerstats NCES platform on when freshmen transfer out of school, by NCES selectivity tier and public/private status.
- ```sat_act_comp_xwalk.csv```: Mapping of ACT composite score to SAT composite score, from [Dorrans 1999](https://www.ets.org/Media/Research/pdf/RR-99-02-Dorans.pdf)
- ```sat_act_xwalk.csv```: Mapping of ACT Math scores to SAT math scores, from [Dorrans 1999](https://www.ets.org/Media/Research/pdf/RR-99-02-Dorans.pdf).

### ./output/ directory
This folder contains all intermediate datasets used for analysis.
- ```analysis_sample.dta```: This is the main dataset used to create / run analysis for the paper.
- ```linkedin_dodf.tsv```: This dataset contains the analysis sample used for establishing assortative matching improvements between major and career from linkedin. Used to create Table 7.
- ```linkedin_sortDF.csv```:  This dataset contains the employer concentration measures for each cohort, derived from LinkedIn data.
- ```ageadj_earn_panel.tsv```: contains weighted averages of earnings measures from ```cohort_earn_panel.csv```, which are calculated by taking distribution of age at entry from BPS NCES survey.
- ```selectivity_nces.dta```: Construction of the NCES' selectivity tier measure for each college in our sample, following the process of [Cunningham 2005](https://files.eric.ed.gov/fulltext/ED488960.pdf).
- ```cohort_graphdata.csv```: A cohort-level dataset of 2005 FB network statistics, derived from ```node_df.csv```.
- ```cohort_earn_panel.csv```: This file incorporates Data on institutional characteristics, and fall enrollment, from IPEDS variables, into a cohort-level panel dataset of earnings in 2014, derived from the Mobility Report Card data. it also contains our main treatment variable, EXPOSURE_4YR, that is  the amount of time a student entering in a particular year would have been exposed to Facebook during college, assuming 4-year completion time.
- ```facebook100/```: Empty  directory used to process the 2005 network data.
- ```demos.csv```: A cohort-level dataset containing gender/ethnicity composition, from IPEDS fall enrollment survey.
- ```gradadj_exposure.dta```: A cohort-level dataset containing measures of FB exposure of each freshmen cohort, adjusted for differential time spent in college.
- ```ic.dta```: A panel dataset of college characteristics, from IPEDS Institutional Characteristics annual surveys. See data dictionary downloadable from [NCES website](https://nces.ed.gov/ipeds/datacenter/DataFiles.aspx) for variable definitions.
- ```node_df.csv```: an individual-level dataset of each college students' network data. for students not on the Facebook network, they are imputed as zero-degree nodes. For students I observe on the platform, we capture their degree (# friends), friends with those who would have graduated by the time they enroll (alumni degree), friends with those overlapping in college (peerdegree), and those in same cohort/class (cohortdegree), along with demographic data described in the ```facebook100.zip``` file
- ```cip4_completions.csv```: contains the # of bachelor degrees at 2-digit CIP code level, Data organized at the year-UNITID level. CIP_X denotes the number of degrees for cip code X. all_degrees denotes total number of degrees. all_ug_degrees total number of undegrad degrees.
- ```cip2_completions.csv```: contains the # of bachelor degrees at 2-digit CIP code level, for each year of completions issued by each college. data organized at CIPCODE-year-UNITID level. TOTAL denotes the # degrees in each category.
- ```grad_rates.dta```: A panel dataset of the graduation rates for each entering class, from IPEDS Graduation Rates annual surveys. See data dictionary downloadable from [NCES website](https://nces.ed.gov/ipeds/datacenter/DataFiles.aspx) for variable definitions.
- ```fall_enrollment.dta```: A panel dataset of enrollment counts each year, from IPEDS Graduation Rates annual surveys. See data dictionary downloadable from [NCES website](https://nces.ed.gov/ipeds/datacenter/DataFiles.aspx) for variable definitions.

### Key Variables in Analysis Data
Brief description of key cohort-level variables used to estimate figures and tables.
- ```UNITID```: school ID issued by IPEDS
- ```super_opeid```: school ID to identify Mobility Report Card cohorts
- ```AY_FALL```: the year the cohort began college.
- ```DateJoinedFB```: Date FB released to campus.
- ```FBName```: Name of School according to Facebook Website.
- ```FBIndex```: The rank of school Facebook was released to (e.g. first school that got Facebook has FBIndex=1)
- ```EXPOSURE_4YR```: the access time of each cohort to Facebook during college, assuming a four-year completion time
- ```EXPOSURE_wadj```: the access time of each cohort to Facebook during college, weighted by dropout/transfer/completion times.
- ```degree```: the average number of Facebook friends, among those with FB accounts , for each cohort  as of 09/2005.
- ```on_fb```: the fraction of cohort with FB account as of 09/2005
- ```avg_degree```: the unconditional average number of Facebook friends for each cohort  as of 09/2005 (counts no FB account as 0 friends).
- ```degree```: the average number of friends those with FB accounts
- ```tiershock```: year x selectivity tier categorical
- ```k_earn```: mean earnings rank of cohort.
- ```k_emprate```: mean employment rate of cohort
- ```k_rank_nozero```: mean earnings rank, excluding 0 income students, of cohort.
- ```k_rank_sd```: estimated std. dev. of earnings rank (see Appendix B)
- ```main_sample```: flag for being in main sample (one of first 760 schools)

- ```late_adopter```: flag for being a school that received access after August 2005 (not in main sample)
- ```twfe_sample```: flag for being in sample used to estimate diff-in-diff robust to treatment heterogeneity (all schools, only up to the class of 2001, as starting in 2002 all schools have access)
